Optimizing Transformations of Stencil Operations for Parallel Cache-based Architectures
نویسندگان
چکیده
This paper describes a new technique for optimizing serial and parallel stencil-and stencil-like operations for cache-based architectures. This technique takes advantage of the semantic knowledge implicitly in stencil-like computations. The technique is implemented as a source-to-source program transformation; because of its speci-city it could not be expected of a conventional compiler. Empirical results demonstrate a uniform factor of two speedup. The experiments clearly show the beneets of this technique to be a consequence, as intended, of the reduction in cache misses. The test codes are based on a 5-point stencil obtained by the discretization of the Poisson equation and applied to a two-dimensional uniform grid using the Jacobi method as an iterative solver. Results are presented for a 1-D and 2-D tiling for a single processor. For the parallel case both blocking and non-blocking communication have been tested. However, the parallel case is not discussed here.
منابع مشابه
Optimizing Transformations of Stencil Operations for Parallel Object-Oriented Scientific Frameworks on Cache-Based Architectures
High-performance scientific computing relies increasingly on high-level large-scale object-oriented software frameworks to manage both algorithmic complexity and the complexities of parallelism: distributed data management, process management, inter-process communication, and load balancing. This encapsulation of data management, together with the prescribed semantics of a typical fundamental c...
متن کاملDomain-Specific Optimization of Two Jacobi Smoother Kernels and Their Evaluation in the ECM Performance Model
Our aim is to apply program transformations to stencil codes in order to yield the highest possible performance. We recognize memory bandwidth as a major limitation in stencil code performance. We conducted a study in which we applied optimizing transformations to two Jacobi smoother kernels: one 3D 1st-order 7-point stencil and one 3D 3rd-order 19-point stencil. To obtain high performance, the...
متن کاملLoop Transformations for Performance and Message Latency Hiding in Parallel Object-oriented Frameworks (extended Abstract)
Application codes reliably achieve performance far less than the advertised capabilities of existing architectures, and this problem is worsening with increasingly-parallel machines. For large-scale numerical applications, stencil operations are often impose the greater part of the computational cost, and the primary sources of ineeciency are the costs of message passing and poor cache utilizat...
متن کاملImproving Scalability with Loop Transformations and
Application codes reliably achieve performance far less than the advertised capabilities of existing architectures, and this problem is worsening with increasingly-parallel machines. For large-scale numerical applications, stencil operations are often impose the greater part of the computational cost, and the primary sources of ineeciency are the costs of message passing and poor cache utilizat...
متن کاملIn-Core Optimization of High-Order Stencil Computations
In this paper, we apply in-core optimization techniques to high-order stencil computations, including: (1) cache blocking for efficient L2 cache use; (2) register blocking and data-level parallelism via single-instruction multipledata (SIMD) techniques to increase L1 cache efficiency; and (3) software prefetching techniques. Our generic approach is tested with a kernel extracted from a 6 th -or...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999